Statistical Science 
2006, Vol. 21, No. 4, 514-531 
DOI: 10.1214/088342306000000439 
In the Public Domain 



o 
o 

(N 

< 

(N 



C/2 



> 

vn 
m 
p 

od 
o 

o 



X 



Advances in Data Combination, Analysis 
and Collection for System Reliability 
Assessment 

Alyson G. Wilson, Todd L. Graves, Michael S. Hamada and C. Shane Reese 



Abstract. The systems that statisticians are asked to assess, such as 
nuclear weapons, infrastructure networks, supercomputer codes and 
munitions, have become increasingly complex. It is often costly to con- 
duct full system tests. As such, we present a review of methodology 
that has been proposed for addressing system reliability with limited 
full system testing. The first approaches presented in this paper are 
concerned with the combination of multiple sources of information to 
assess the reliability of a single component. The second general set of 
methodology addresses the combination of multiple levels of data to 
determine system reliability. We then present developments for com- 
plex systems beyond traditional series/parallel representations through 
the use of Bayesian networks and flowgraph models. We also include 
methodological contributions to resource allocation considerations for 
system relability assessment. We illustrate each method with applica- 
tions primarily encountered at Los Alamos National Laboratory. 

Key words and phrases: Bayesian, Bayesian network, biased data, 
complex system, count data, degradation data, fault tree, flowgraph, ge- 
netic algorithm, lifetime data, logistic regression, Markov chain Monte 
Carlo, Metropolis algorithm, multilevel data, nonhomogeneous Poisson 
process, prior elicitation, reliability block diagram, repairable system, 
resource allocation. 



1. INTRODUCTION 

By definition, reliability is the probability a sys- 
tem will perform its intended function for at least 
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a given period of time when operated under some 
specified conditions. The systems that we are asked 
to assess are becoming increasingly complex, includ- 
ing, for example, nuclear weapons, infrastructure 
networks, supercomputer codes and munitions. In 
many instances it is not possible to mount vast num- 
bers of full system tests, and frequently none is avail- 
able (Bennentt, Booker, Keller-McNulty and Singpur- 
walla, 2003). Systems reliability methodology is faced 
with the challenge of developing models for these 
complex systems and integrating multiple, sometimes 
indirect, sources of information to perform estima- 
tion, make inferences and answer questions about 
the allocation of additional testing resources. 

This paper focuses on four methodological issues 
that arise from complex systems reliability prob- 
lems. In Section 2, we address methods for integrat- 
ing multiple data sources to assess the reliability of 
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a single component. The data may come from many 
sources, including experimental test results, com- 
puter simulations and expert opinion. In Section 3, 
we consider methods for assessing systems reliability 
when the data are available at multiple levels (e.g., 
both system and component). Again, there may be 
multiple sources of data at each component or at the 
system itself. In Section 4, we discuss Bayesian net- 
works and flowgraph models, which are richer rep- 
resentations that are able to model more systems 
than fault trees or reliability block diagrams can. In 
Section 5 we consider the resource allocation prob- 
lem for systems. Section 6 summarizes our view of 
some of the current research challenges in systems 
reliability assessment. 

The analyses presented here follow hierarchical 
Bayesian approaches and focus on estimating the 
reliability R{t), in most cases as a function of time. 
We will write R{t\@) to denote reliability given un- 
known parameters G, and after obtaining a poste- 
rior distribution 7r(G|D) for Q based on data D, 
estimates of R{t) can be obtained from, for exam- 
ple, the posterior mean / R{t\Q)Tr{Q\D) d@. In each 
example we use these meanings for R,Tr,Q and D. 

2. INTEGRATING MULTIPLE DATA SOURCES 
TO ASSESS COMPONENT RELIABILITY 

In this section, we consider the assessment of com- 
ponent reliability when multiple data sources are 
available. Ideally, we would like a large set of pass /fail 
tests or failure time observations to estimate the re- 
liability of a component. We are often in situations 
where this is not the case, but we are able to sup- 
plement our data with other information sources. 
In this section, we consider specifically degradation 
data, surrogate data and a biased sample of pass/fail 
data. 

2.1 Degradation and Failure Time Data 

An important practical example is the case where 
failure time data are augmented with degradation 
data. Suppose that we are interested in the lifetime 
distribution of a component. In the past we have 
observed ni failures at times Tj for j = l,...,ni. 
A further n2 components are still functioning and 
their ages are Aj for j = ni + 1, . . . ,ni + n2. Fi- 
nally, components were destructively tested and 
these tests yielded the continuous measurements Yj 
at ages tj for j = ni -|- n2 + 1, . . . , -|- n2 + ns. The 
Yj tend to decrease with age and it is thought that 



this decrease is closely related to the eventual failure 
of the components. 

We seek to analyze these data simultaneously us- 
ing a hierarchical Bayesian approach by first assum- 
ing that the degradation process satisfies 

Yj ~ Normal(Q; — I3j^tj,a^). 

This assumption implies that components are identi- 
cal at birth, although measurement error is present 
even when testing new units. Differences in com- 
ponents arise later as each is allowed to degrade 
at its own rate I3j^ ■, and we assume that log/Jj ~ 
Normal (/i, (T^). We estimate both n and a"b; ^ has a 
normal prior distribution and has a gamma prior 
distribution. The measurement error standard devi- 
ation Gy is also given a gamma prior distribution. To 
relate this degradation process to the failure times, 
assume that a critical lower level L exists and that 

Tj = inf{t > : a - (ij^t < L} = {a - L)[3j, 

so that logTj ~ Normal(;U -|- log(a — L),al). In this 
problem the reliability is defined to be the survivor 
function of a generic lifetime T, P{T > t}. The level 
L can be given a prior distribution and estimated; 
in most cases the value of the degradation process 
that is required for successful performance will be 
approximately known, so that this prior distribu- 
tion will be informative. We assume that L/a ~ 
Beta(a, 6). The lognormal distribution for Tj defines 
the likelihood for both the censored lifetimes and 
the observed lifetimes. This yields the unnormalized 
posterior distribution 

7r{e\D) 

= 7r(a, /?, cJb, At, o"y, L|T, A, y) 

oc<p(^ — j(py ^ jcTy exp(-r^j^cry) 

ni 

(1) -lliiatT.r'cPiilogT.-f^ 

-log{a- L)}/ab]) 

n\+n2 

■ n (i-<&K"Hiog^,-// 

j=ni+l 

-log(a-L)}]) 
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j=ni+n2+l 

■H{yj-a-Pi't,}/ay)], 

where (j) and <I> denote the standard normal density 
and distribution functions, respectively, and where 
ma,Sa,m^,Sfj,,Say,r„y,Sa^,r^^,a and b denote fixed 
quantities that define prior distributions for a and 
other parameters. Samples from this unnormalized 
posterior distribution can be drawn using a variable- 
at-a-time random walk Metropolis algorithm. 

As an example, consider a simulated population 
of items at time 20 years after fabrication. We have 
observed four failures, all in the last two years, and 
76 items have survived to this point. We also have 
one degradation data point per year up to year 20. 
The data were simulated under a = 100, L = 20 and 
fi = log(0.35) = —1.05; this implies that the degra- 
dation curve will cross level L at age 0.35(100 — 
20) = 28 years. Other parameters of the simulation 
include a^, = 0.2 and ay = 5. In our prior distri- 
butions, we used a ~ Gamma(4, 1 /SO) (with mean 
120), ay ~ Gamma(4, 1/2.5), ai, ~ Gamma(4, 5), /.t ~ 
Normal(0, 1), and L/a Uniform(0, 1). The results 
are shown in Figure 1. The solid curve is the true re- 
liability (survivor function) R{t), the dashed curve 
is / $( ^^'°^^"^^^^ '°^* )7r(0|-D), the posterior mean 
of the survivor function, and the dotted curves are 
the 5th and 95th percentiles of the posterior distri- 
bution. There is substantial uncertainty in the re- 
liability just a few years into the future, but this 
is considerably better than could be obtained using 
the (mostly censored) failure times alone. Posterior 
estimates (and 90% posterior probability intervals) 
for the parameters are 99.2 (92.9, 105.1) for a, 17.6 
(2.3, 34.6) for L, -1.00 (-1.21, -0.76) for ^, 6.57 
(3.8, 10.3) for ay and 0.24 (0.14, 0.35) for aj,. This 
approach has the advantage that the threshold pa- 
rameter L does not need to be known with certainty 
and can be estimated; doing so can provide a diag- 
nostic for the value historically assumed for L. The 
approach can also benefit from strong prior infor- 
mation about L, which might come from physical 
or engineering knowledge used to define the require- 
ments for the component. 

2.2 Bernoulli and Quality Assurance Data 

Anderson-Cook et al. (2005) applied ideas from 
medical statistics to combine pass/fail data with 
component quality assurance data to get more pre- 
cise reliability estimates. Anderson-Cook et al. (2005) 



actually worked in a system context but here we dis- 
cuss the single component variant of the problem; 
see Section 3.2 for the system extension. A compo- 
nent undergoes destructive pass/fail testing at var- 
ious ages. Suppose that age is the only covariate 
of interest, although the model is general enough 
to allow multiple covariates. Further suppose that 
the component can also be tested destructively for 
adherence to up to J published specifications. We 
assume that each such test related to the jth speci- 
fication (j = 1, . . . , J) yields (possibly after transfor- 
mation) a normally distributed measurement with 
mean aj + 6jt and variance 7^ if the test is con- 
ducted at age t. It is thought that these specifica- 
tion measurements are related to the component's 
performance in a pass/fail test, and we assume that 
the measurements have been transformed so that 
large values of the measurement are thought to be 
good. We now invoke an assumption to relate the 
two types of data. This assumption is inspired by 
the concept of surrogate variables in medical studies 
(Prentice, 1989; Pepe, 1992). Suppose that it were 
possible to obtain a system test Y on the same unit 
where we obtained a full set of specification mea- 
surements Zi,. . . ,Zj. Then we assume 



Pr{y = l|Z,t} 



an 



independently of t. In this model, each of the J 
quantities represented in specifications is indepen- 
dently capable of causing failure, and it is not possi- 
ble, for example, for two quantities with somewhat 
low values to collaboratively cause failure. If the lat- 
ter behavior is desired, it is possible to replace the 
product with a multivariate normal integral. Here Oj 
and aj are unknown, given prior distributions, and 
estimated. Their prior distributions can be informa- 
tive if the published specifications are thought to be 
highly relevant to reliability. The key result, since 
it is impossible to observe the Zj for a component 
that undergoes pass/fail testing, is that the Zj can 
be integrated out, so that 



(2) 



R{t\e) = Fr{Y = i\t,e} 



7? + -| 



Terms like this can be multiplied by normal density 
terms that reflect the specification measurements 
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SO as to combine the two sources of data. Assum- 
ing that the data consist of system tests Yi, . . . , 
taken at ages ti,...,tm and specification measure- 
ments Zi, . . . ,Zn taken at ages ri , . . . , , where mea- 
surement Zi corresponds to the kith specification, 
the hkehhood function is 

L{a,6,-f,a,e\Y,Z) 

m 

= X{R{u\Qf'{^-R{umY-''' 



i=l 



2.3 Biased and Unbiased Samples 

Graves et al. (2006) discussed a challenging prob- 
lem whose solution could be applied in a reliability 
context because it involves the estimation of preva- 
lence of a feature in a stratified population. A pop- 
ulation of items was manufactured in lots, and it 
was of interest to estimate the fraction of items in 
each lot with a certain feature. There was reason 
to believe that feature prevalence had a nonzero, 
but imperfect, relationship with lot membership, so 
the authors assumed that if the jth lot was of size 
Nj, the number of features Kj in the lot had a 



Binomial(A'j',pj) distribution, where the pj had a 
hierarchical prior pj ~ Beta(a, 6), with a and h given 
prior distributions. Some of the lots were inspected 
using random (hypergeometric) sampling: a sample 
of size rij was taken from the jth lot for inspec- 
tion and features were found. These data alone 
can be analyzed using a Markov chain Monte Carlo 
(MCMC) algorithm to obtain samples from the joint 
distribution of (a, 6,p,K). However, some other fea- 
ture data were available from items selected using 
nonrandom sampling (a "convenience sample" ) ; the 
selection process may or may not have been inde- 
pendent of feature presence. To combine these two 
sources of data, one needs to model this nonran- 
dom sampling, and Graves et al. (2006) used the 
extended hypergeometric distribution. In fact, the 
convenience samples were taken before the random 
samples. Denote by and y| the sample size and 
number of features found from the jth lot in the con- 
venience sample. Then the extended hypergeometric 
model is 



P{yt = y) 



-y 



miTL(n1,Ki) 
j=ra-A-K{0,n1-N^+Ki) 



Ni- 



-j 




Fig. 1. Reliability estimates with uncertainty bands for the degradation and failure time data integration example. The solid 
curve IS the true reliability function, the dashed curve is the posterior mean and the dotted curves are the 5th and 95th 
percentiles of the posterior distribution. 
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When the unknown biasing parameter = 1, this is 
the hypergeometric model; for 9 > 1, items with the 
feature are more likely to be sampled, and so forth. 
Graves et al. (2006) assumed that the amount of 
biasing is constant in each lot (6 does not depend 
on the lot), put a lognormal prior distribution on 6 
and estimated the amount of biasing. Their data set 
turned out to be inconclusive about the direction of 
the bias. The likelihood for the randomly sampled 
data is 

Vi ~ Hypergeometric iV[ - K[, n[), 
which is to say 



P{y: = y) = 




where N[ = Ni — n1 and Kl = Ki — yf. Graves et 
al. (2006) sampled from the resulting posterior dis- 
tribution of {p,K,a,b,6) using YADAS. Integrating 
the convenience samples with the randomly sampled 
data enabled a more precise estimate of the quantity 
of interest — the prevalence of features among the 
unsampled items, /(K) = "EiiKi - yf - yj) / J^A^i - 
nf — n^) — without making unwarranted assumptions 
such as the prevalence of features being the same in 
each lot or the convenience sampling being done in- 
dependently of feature presence. In a simplified case 
where items that lack the feature have reliability 1 
and items with the feature have reliability 0, the pos- 
terior mean reliability is the integral of /(K) with 
respect to the posterior distribution of K. 

Further study is required before one can recom- 
mend using a more informative prior for the amount 
of bias 6. It is difficult to relate the parameter to 
knowledge about the sampling process in a quanti- 
tatively precise manner. If the biasing mechanism 
is better understood, that mechanism should be ex- 
plicitly included in the model rather than the ap- 
proach given here. 

3. ASSESSING SYSTEM RELIABILITY WITH 
MULTILEVEL DATA 

In Section 2, we discussed combining multiple data 
sources to assess a single component. In this sec- 
tion, we consider combining multiple sources of data 
in a system reliability assessment. In particular, we 
consider situations where we have data about both 
components and combinations of components — for 



example, about the entire system. Hamada et al. 
(2004) developed models for the case of a fault tree 
with binary data at basic, intermediate and top events 
Here we give examples of combining failure time 
data, failure count data, Bernoulli data and degra- 
dation data. 

3.1 Logistic Regression, Weibull Lifetimes and 
Degradation 

As an example of integrating multilevel reliability 
data, we work with a variant of an analysis discussed 
in Graves and Hamada (2005). The system con- 
sists of three components combined in series, and all 
three components may see degrading performance 
with age. For component 1, we have binary test data 
at various ages and we assume a logistic regression 
relationship for the success probability as a function 
of age. If Xi denotes a generic component 1 of age t 
(centered) and Xi = 1 denotes component success, 

logit Fr{Xi = 1} = 6*0 + 0it. 

We assume independent normal priors for and 6i, 
and in our simulated data, we have 25 tests each at 
ages 0, 2, 4, 6, 8, 10, 15 and 20, with one failure at 
age 4, two at age 15 and six at age 20. 

Component 2 is assumed to have a Weibull life- 
time distribution with 

Pr{r2>t} = exp(-Aot^i), 

where T2 denotes a generic lifetime for component 
2. Component 2 is said to work properly in a test if 
its life has not yet ended at the time it is tested. We 
observe eight uncensored lifetimes ranging from 14.1 
years to 33.5 years, with 13 lifetimes right-censored 
at 20 years and four right-censored at 40 years. 

Our data for component 3 mirrors the analysis 
in Section 2.1: we have ten total pieces of degra- 
dation data taken every two years [these data are 
normal with mean a + Pj'^tj and variance ay, with 
log/3j ~ Normal(/i, cj^)] and 80 lifetimes, all but two 
of them censored at 20 years. [The logs of these data 
are normal with mean + log(Q; — D) and variance 
cj^.] This time, we assume that D = 20 is known 
with certainty. Finally, we also have binomial sys- 
tem test data (15 tests each at ages 0, 5, 10, 15 and 
20, with one failure at age zero and three at age 
20). Since this is a series system, the probability of 
system success for a system of age t is then 

R{t\e) = logit-^ {00 + 6it) ■ exp{-Xot^') 

■ {1 - H{logt - ^ - log(a - D)}/ab)}. 
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The component data sets can be analyzed together 
with the system data by multiplying all the likeli- 
hood functions with the prior distributions for all 
the unknown parameters. Again, samples from the 
posterior distribution can be drawn using a variable- 
at-a-time random walk Metropolis algorithm, and 
setting up the problem is straightforward in YADAS 
(Graves, 2003). YADAS can handle much larger sys- 
tems (e.g., Johnson, Graves, Hamada and Reese, 
2003), for the case of pass/fail data with no aging 
at all levels) . The user can specify the system struc- 
ture in a file and component data can take many 
forms, assuming only that the user can express the 
success probability at each component as a function 
of unknown parameters. Figure 2 displays the re- 
sults of the analysis. For each component and for 
the full system, we display the mean and 5th and 
95th percentiles of the posterior distribution of reli- 
ability as a function of age. Component 2 dominates 
the unreliability at early ages, while the other two 
components are bigger concerns at later ages. 

3.2 Combining Partially Informative System 
Tests with Component Tests 

Anderson-Cook et al. (2005) analyzed data from 
a system in which the system pass/fail testing data 



provide incomplete information about which compo- 
nent (s) was responsible for a failure. In particular, 
for the ith test, the data consist of a set Ci(i) of 
components known to have worked, a second set of 
components C2(i) known to have failed, and a third 
set of components Cs{i), where it is known that at 
least one component in that set failed. Anderson- 
Cook et al. (2005) did this in the context of combin- 
ing these system tests with component specification 
testing data (see Section 2.2). In a multiple com- 
ponent context, denote by pik the probability in (2) 
that component k works properly in test i. Then the 
probability of observing data {Ci{i) , C2{i) , Cs{i)) given 
these component success probabilities is 

I n p^M n n 

where the third product is understood to equal if 
it is empty (the other products are 1 if empty). Re- 
sults obtained by Anderson-Cook et al. (2005) for a 
two-component series system are shown in Figure 3. 
Denoting by Ri{t\Q) the reliability of component i 
given in expression (2), the posterior mean system 
reliability is J Ri{t\e)R2{t\e)7r{e\D) d@. Since the 




Fig. 2. Reliability estimates and uncertainty intervals for the three- component system. Upper left: Component 1, which 
has logistic regression data. Upper right: Component 2, with Weibull failure time data. Lower left: Component 3, with both 
degradation data and lognormal failure time data. Lower right: The full series system with all four data sets. 
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data are proprietary, both axes (time and reliabil- 
ity) have been rescaled to [0,1]. The black curves 
show the integration of the two types of data (poste- 
rior means, 5th and 95th percentiles of the posterior 
distribution). The solid and dashed curves show the 
previous methodology used by the system engineers: 
logistic regression using full system data only. The 
component test data are in this case available for 
older components, which greatly tightens the uncer- 
tainty bounds for older systems (dotted lines). (This 
analysis depicts a small subsystem of the full system, 
and none of the components in the small subsystem 
appears to age significantly.) 

This is a form of "autopsy data." Meilijson (1994) 
used the expectation-maximization algorithm to ob- 
tain maximum likelihood estimates for failure time 
distribution parameters from the failure time of the 
system and the set of components that failed by 
that time. Gasemyr and Natvig (2001) worked with 
lifetime data, where the set of failed components 
is identified when the system fails and some com- 
ponents are monitored either at all times or from 
certain time points onward (if a component fails 
while being monitored, its failure time is observed 
exactly). They also observed systems that did not 
fail before a censoring time. They derived expres- 
sions for the likelihood function under general sys- 
tem structures, including the case of dependent fail- 
ures, and identified conjugate prior distributions in 
the case that failure times follow generalized gamma 
distributions. 

3.3 Nonhomogeneous Poisson Process 

Highly clustered modern supercomputers are ex- 
amples of systems composed of many similar sys- 
tems in series. Ryan and Reese (2005) presented a 
model for the reliability of a Los Alamos National 
Laboratory supercomputer that consists of 48 highly 
similar computers. While they are often referred to 
as massively parallel, a job that begins on n = 48 
components in a cluster will finish only if all n com- 
ponents function correctly for the duration of the 
computational task. Essentially, these 48 computers 
behave as 48 repairable systems in series. Figure 4 
plots the cumulative number of failures versus time 
for each of the 48 computers. There is one "outlying" 
computer with considerably more failures. In partic- 
ular, computer 21 is different in both structure and 
usage. 

Whereas this is a repairable system, we seek to 
establish a stochastic point process, N{a,b), for the 



number of failures in an interval (a, 6]. We further 
define N{t) as the number of failures in (0,t]. An 
important class of models for failure times of a re- 
pairable system is that of the nonhomogeneous Pois- 
son processes (NHPP). An NHPP is defined by its 
nonnegative intensity z^(t). Under a NHPP: 

• The process N{a, b) is a Poisson random variable 
with mean /i(a, b) = v{t) dt. 

• The processes iV(ai,6i) and (02,^2) are inde- 
pendent if (ai,&i) and (021^2) are disjoint (i.e., 
either bi < 02 or 62 < «i)- 

Power law process (PLP) and loglinear process 
models are common choices for the intensity func- 
tion vit). Ryan and Reese (2005) introduced an ex- 
tended model that includes a positive parameter p to 
model appropriate asymptotic behavior. They con- 
sidered intensities of the form 



v{t) 



7] \ri 



When (j) <1, the system undergoes reliability growth 
and has a limiting failure rate of p. (The intensity 
never increases or levels off to a constant value, re- 
gardless of the choice of 0.) 

We present a hierarchical Bayesian model for the 
Poisson process that governs these data. Assume 
that the number of failures experienced by computer 
i in month j, X = [xij] (a C x M matrix), has prob- 
ability mass function 



p{X\ri,(j),p) 



C 

n 

i=l 



■ M 

n 

■j=i 



'/'^ ft{j-i)\'i' 



+tpi 



■ exp 



/Mt 



+ Mtp, 



Next, allow for a gamma prior distribution on rj that 
is parameterized in terms of the mean pr] and stan- 
dard deviation u^. That is, use 



c 

n 

i=l '- 



(At„/(T„)2-1 

77 >' ' exp 
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Fig. 3. Comparison of two reliability estimation procedures. Left: Logistic regression on full system data alone. Right: Inte- 
gration of component specification tests and partially informative system tests. Shown are posterior medians and 5th and 95th 
percentiles of posterior distributions. 
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Fig. 4. Empirical cumulative failure counts of 48 components. 



Similarly, let 



and 



c 

n 



V Tiifip/apY) 



c 



C 

n 

j=i 



1=1 



,.W/<x,)^-iexp(-^< 
cri 



This hierarchical specification assumes a priori con- 
ditional independence of the computer-specific pa- 
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Fig. 5. Posterior distributions of six-hour reliability versus start time. Shown are the posterior median and 0.05 and 0.95 
posterior quantiles. 



rameters. This assumption is not as restrictive as 
that of complete independence. In fact, a posteriori, 
the parameters wih reflect dependence as manifest 
by the data. As such, we are willing to make this 
assumption. 

The distribution of (ry, p\lJ,-q, cr^, /x^, cr^, /Xp, dp) has 
density 

Pi'n,^,P\lJ'ri, (Tjj, fl^, 0-0, flp, Up) 

Finally, let 



' Weibull(a/,^,6^J, 



Weibull(a<x^,6<x^), 

Weibull(a^^,6pJ, 

~ Weibull(a^^,6^^), 

/ip~ Weibull(a/,p,6^J, 

crp~ Weibull(a^p,&<^p). 

For the ith computer, Ni{a,b) is a Poisson random 
variable with mean fii{a,b). Thus, the probability 
that the ith computer has no failures in (a, b) is 

P{Ni{a,b) = 0\(l)i,r,i,pi) 



1 



exp 



a \ f b 

— +pia - [ — 



Pib 



We use an operational definition of reliability to 
mean a job of length I run on a computer of age 
s finishes without computer failure. Since the super- 
computer is a series system in its 48 components, 
reliability R{l,s\Q) is 

48 r 



Rii,s\e) = l[ 



4 = 1 



exp 



s + l 
Vi 



Pil 



Figure 5 summarizes the posterior distribution of 
reliability i?(6,s|0) versus start time s for six-hour 
computer runs. The three lines included on this plot 
are the 0.05 and 0.95 quantiles and the median of 
i?(6, s|0) with respect to -k{Q\D). As s increases, 
these three lines increase, indicating reliability growth. 

While this is a simple system example, it illus- 
trates the power of Bayesian hierarchical models for 
integrating multiple, similar sources of information 
to assess overall system reliability. The next example 
considers a simple system composed of very differ- 
ent components and the combination of system and 
component testing. 

3.4 Lifetime Data 

As a demonstration of the multiple levels of data 
collected on a simple system, consider a system that 
consists of only three components that are all re- 
quired to work so that the system as a whole works. 
An event tree representation of such a system is 
pictured in Figure 6. While this is a simple sys- 
tem, important data combination methods can be 
illustrated. There are four reliability functions of in- 
terest: one for each of the three components and 
one additional reliability function, which is the sys- 
tem reliability function. Furthermore, suppose that 
at each component we conduct nj = 20, i = 2,3,4, 
tests and record the time until failure. We also col- 
lect ns = 10 full system tests independent of the 
component data and observe the time until failure. 
Given this system structure and the test data, we 
can explore the features of the proposed Bayesian 
system reliability modeling. 

Goodness-of-fit techniques revealed that a reason- 
able model for the distribution of failure times of the 
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Fig. 6. Reliability event tree for system reliability. 



components is Weibull, that is, 



Ui ( t 



exp 



, i = 2,3,4. 



Note that this parameterization of the Weibull dis- 
tribution is different than that in Section 3.1, with 



Ao = (l//3j)f and Ai = Oj. Here the component relia- 
bilities Ri{t\e) {i = 2, 3, 4) are given by ]^ f{t\Q) dt, 
so that the system reliability at time t is Rs{t\Q) = 
R2{t\@)R3{t\@)R4{t\Q). Our prior specification is 
that and (3i are all exchangeable (i.e., indepen- 
dent given their prior parameters) and are from a 
common gamma distribution, that is, 

p{ai\Xa,Ca) tx al''~^exp{-Caai), 

p{l3i\h,Cb) ocaf''"^exp(-Cbai). 

Then, to complete the hierarchical specification, we 
propose that Xa,Ca,\ and d, have exponential dis- 
tributions, each with their own rate parameters. 

Given the specification above, we use a successive 
substitution MCMC procedure where each compo- 
nent of the joint posterior distribution is updated 
one at a time. The posterior distributions (as a func- 
tion of time) for the reliability function of each of 
the components in the example system are presented 
in Figure 7. They are organized as upper left, the 
posterior distribution of the full system Ci; upper 




tn BG im 






Fig. 7. Posterior distributions (as a function of time) for the reliability function of each of the components in the system. 
The upper left panel is the posterior distribution of the full system Ci ; the upper right panel is the posterior distribution for the 
component C2; the lower left panel is the posterior distribution for component C3; and the lower right panel is the posterior 
distribution for component Ca • 
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right, the posterior distribution for component C2; 
lower left, the posterior distribution for component 
C3; lower right, the posterior distribution for com- 
ponent C4. 

We note, in particular, that the posterior distri- 
bution on the system reliability function is less vari- 
able than that of any of the components. We only 
used ten system tests and 20 component level tests, 
suggesting that the component testing has improved 
our state of knowledge about the system. Further, 
we note that the improvement does not reflect an 
improvement of the magnitude expected if we add 
60 (or even 20) full system tests. This would result in 
a posterior distribution with much less uncertainty. 
Therefore, the component testing does not inform 
the posterior proportionately to a full system test, 
but it does improve our knowledge and can be par- 
ticularly helpful when full system tests are sparse. 

3.5 Elicitation for Reliability 

The issues around elicitation for reliability fall 
into three categories: elicitation methodology and 
techniques, elicitation for parameters and prior spec- 
ification in reliability models, and elicitation of sys- 
tem structure and dependencies. The first two are 
relatively well studied; the last is an open research 
area. 

Kadane and Wolfson (1998) stated that "the goal 
of elicitation, as we see it, is to make it as easy as 
possible for subject-matter experts to tell us what 
they believe, in probabilistic terms, while reducing 
how much they need to know about probability the- 
ory to do so." There is emerging consensus that the 
following assertions represent good technique for pa- 
rameters and prior elicitation (Kadane and Wolfson, 
1998, page 4): 

1. Experts should be asked to assess only observ- 
able quantities, conditioning only on covariates 
(which are also observable) or other observable 
quantities. 

2. Experts should not be asked to estimate moments 
of a distribution (except possibly the first mo- 
ment); they should be asked to assess quantiles 
or probabilities of the predictive distribution. 

3. Frequent feedback should be given to the expert 
during the elicitation process. The feedback can 
be graphical or verbal, and it should help the 
expert develop coherent probabilities and under- 
stand the implications of previous choices. 



4. Experts should be asked to give assessments both 
unconditionally and conditionally on hypotheti- 
cal observed data. 

The psychological underpinnings of these recom- 
mendations are summarized in Meyer and Booker 
(2001). Specific statistical techniques for deriving 
predictive distributions that are useful in reliabil- 
ity and calculating parameter and hyperparameter 
distributions from elicited quantities can be found 
in Kadane and Wolfson (1998), Percy (2002) and 
Gutierrez-Pulido, Aguirre- Torres and Christen (2005). 
More detailed elicitation case studies appear in Keeney 
and von Winterfeldt (1991) and O'Hagan (1998). 

Elicitation of priors for component parameters for 
systems reliability is especially difficult because, given 
a fault tree or reliability block diagram structure, 
the prior distributions for parameters at the com- 
ponents induce prior distributions on the system. 
For example, suppose that we have a series system 
with component reliability pi and that we assume a 
Uniform(0, 1) prior for each pi. This does not imply 
that there is a uniform prior on the system itself. 
Given a series system with k components, the prior 
distribution on the system is \r{h)]~^{—\ogp)^~^, 
which has mean 2"'^ (Parker, 1972). If the system 
reliability has a Uniform(0, 1) distribution and we 
assume that each of the k components has the same 
prior distribution, then this prior distribution is 
[T{l/k)Y^{-\ogp)^^^~^'^/^, which has mean 2"^'=. 

The elicitation of system structure and depen- 
dencies among components and failure modes is an 
emerging area of research. Neil, Fenton and Nielson 
(2000), Lee (2001) and Wilson, McNamara and Wil- 
son (2007) discussed the construction of Bayesian 
network representations (Section 4) for complex sys- 
tems. Seshasai and Gupta (2004) discussed the mod- 
eling of structure and information within engineer- 
ing design process. Klamann and Koehler (2005) 
proposed qualitative methods for the determination 
of system structure. The issues are the determina- 
tion of the correct granularity for representing com- 
ponents and functionality, and the appropriate de- 
pendencies among the components, functions and 
failure modes. Qualitative models of systems that 
capture these features underlie the development of 
quantitative statistical models for systems reliabil- 
ity. 

4. REPRESENTING SYSTEMS 

Fault trees and reliability block diagrams are the 
most common representations in system reliability 
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(a) Serial: P(,A,B.C} = P(r BWiB A)P(A) 
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(c) DivLipiig; P{A. B,C) = P(C ,4JP{^ A)P{A} 

Fig. 8. Specifying joint probability distributions using a 
Bayesian network. 



analysis. However, there are situations where these 
models do not offer enough flexibility to capture fea- 
tures of the system. Bayesian networks generalize 
fault trees and reliability block diagrams by allowing 
components and subsystems to be related by con- 
ditional probabilities instead of deterministic AND 
and OR relationships. Flowgraph models are multi- 
state models that simplify the analysis of time-to- 
event data. 

4.1 Bayesian Networlcs 

There is growing literature on the use of Bayesian 
networks (BNs) in reliability (e.g., Portinale, Bobbio 
and Montani, 2005; Sigurdsson, Walls and Quigley, 
2001; Lee, 2001), although there is quite a broad 
literature on using BNs for probabilistic modeling 



(e.g., Spiegelhalter, 1998; Neil, Fenton and Nielson, 
2000; Laskey and Mahoney, 2000; Jensen, 2001). 

Formally, a BN is a pair N = {(y,E),P), where 
{V,E) are the nodes and edges of a directed acyclic 
graph and P is a probability distribution on V. 
Each node contains a random variable, and the di- 
rected edges between them define conditional depen- 
dences/independences among the random variables. 
Figure 8 summarizes the three probabilistic relation- 
ships that can be specified in a BN. The key feature 
of a BN is that it specifies the joint distribution P 
over the set of nodes V in terms of conditional dis- 
tributions. In particular, the joint distribution of V 
is given by 

P (v I parents [t;]), 

where the parents of a node are the set of nodes with 
an edge pointing to the node. For example, in the 
serial structure in Figure 8(a), the parent of node C 
is node B, and node A has no parents. 

Bayesian networks can be used direct gener- 
alization of fault trees. The fault tree translation to 
a BN is straightforward, with the basic events that 
contribute to an intermediate event represented as 
parents and a child. Figure 9 shows the correspon- 
dence between a fault tree AND gate and a BN con- 
verging structure. Notice that a fault tree implies 
specific conditional probabilities. The same BN con- 
verging structure works for an OR gate, with the 
appropriate conditional probabilities. 

Suppose that we have the BN from Figure 10 and 
suppose that we are interested in calculating the 
posterior probability for each component and the 
full system. Hamada et al. (2004) discussed how to 
approach this problem for the special case of fault 
trees. 





PfC = ] A = \,B=i\^l 

p(C^lA^(i,B^l\^l 
Pic =lA = Q,B = a} = 



Fig. 9. Fault tree conversion to a Bayesian network. 
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Fig. 10. Bayesian network generalization of the system ex- 
ample. 

Suppose that we have the same data as given in 
Section 3.1. However, instead of a series system, we 
have the relationships 
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Again, drawing from Section 3.1, let pi{t) = 
logit-^(eo + Oit), P2{t) = exp(-Aot^O and p-i{t) = 
{1 -^>({logt-/i-log(a-L')}/cJb)}. Then the sys- 
tem data can be modeled as Binomial(p5(t)), where 

Ps{t) = ^.Qpi{t)p2{t)p^{t) 

+ 0.4(1 -pi(t))p2(t)p3(t) 
+ 0.3pi(t)(l-p2(t))p3(t) 

+ 0.5pi(t)p2(t)(l-P3(i)) 

+ 0.1(1 -Pi (0)(1-P2(t))p3(t) 

+ 0.05pi(t)(l-p2(t))(l-P3(i)) 
+ 0.25(1 -pi(t))p2(t)(l-P3(i)). 

Here we omit the dependence on for space rea- 
sons. Figure 11 shows the posterior distribution for 
system reliability: the posterior mean is solid and 
the 5th and 95th percentiles are dotted. 

The example given above can easily be general- 
ized to the situation where the conditional proba- 
bilities are not known, but are described by a distri- 
bution. Neil, Fenton and Nielson (2000) and Wil- 
son, McNamara and Wilson (2007) discussed the 
construction of system models for BNs in detail. 
For additional examples, see Farrow, Goldstein and 
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Fig. 11. Reliability estimates with uncertainty bands for the Bayesian network example. The solid curve is the posterior 
mean and the dotted curves are the 5th and 95th percentiles of the posterior distribution. 
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Fig. 12. Flowgraph model for a series pump system. 



Spiropoulos (1997), Bedford and Cooke (2001) and 
Portinale, Bobbio and Montani (2005). 

4.2 Flowgraph Models 

Flowgraph models offer another representation that 
can be useful in solving reliability problems. A flow- 
graph model is one type of multistate model. It is 
useful for capturing potential outcomes probabilities 
of outcomes, and waiting times for outcomes to oc- 
cur, and is often used to model time-to-event data. 
Like a graphical model, a flowgraph consists of nodes 
and arcs. However, in a flowgraph model, the nodes 
(or states) represent outcomes. This differs from a 
BN, where the nodes represent random variables. 

Consider Figure 12 from Huzurbazar (2005). This 
flowgraph models the states of a pump system with 
two pumps. The pumps operate independently and 
the system can operate with one pump if necessary. 
The nodes of the system represent states of the sys- 
tem: state represents no failed pumps, state 1 rep- 
resents one failed pump and state 2 represents two 
failed pumps. One quantity of interest is the time 
to total failure, or the total time to transition from 
state to state 1 to state 2. 

The directed line segments in a flowgraph are bran- 
ches. Each branch has a transition probability and 
waiting time distribution associated with a transi- 
tion from its beginning to ending nodes. The branches 
are labeled with transmittances, each of which is 
the transition probability multiplied by the moment 



generating function of the waiting time distribution. 
For example, in Figure 12, the transition probability 
from state to state 1 is 1.0 and the moment gen- 
erating function of the waiting time distribution is 
Moi(s). In Figure 13, the transition probability from 
state 1 to state is piQ and from state 1 to state 2 
is pi2, where pio + pi2 = 1. In this example, there 
is no probability of staying in state 1 — eventually a 
transition always occurs. 

Suppose that in Figure 12 the pumps fail indepen- 
dently with an exponential distribution with mean 
I/Aq, Exponential(Ao). The transition from state 
to state 1 happens when either of the pumps fails, 
which means that its waiting time is the minimum 
of two independent exponential distributions, which 
has an Exponential(2Ao) distribution. 

Once in state 1, we assume that the remaining 
pump has a failure time with an Exponential Ai) 
distribution, with Ai > Aq. We assume that Ai > Aq 
to account for the extra stress on the pump once the 
flrst has failed. The waiting time to transition from 
state to state 1 to state 2 is the sum of independent 
exponential distributions. Since the waiting times 
are independent, the moment generating function 
of the sum of the waiting times is the product of the 
individual times. The moment generating function 
for an Exponential(A) is M(s) = A/(A — s) for s < A. 
The moment generating function for the transition 
from state to state 2 is 

Mo2(s) = Moi(s)Mi2(s) 
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Fig. 13. Flowgraph model for a series pump system with feedback. 
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This moment generating function uniquely deter- 
mines the distribution of the waiting time. Since 
we can now write an equivalent flowgraph with the 
transmittance from state to state 2, we have solved 
the flowgraph from to 2. Huzurbazar (2005) gave 
a general algorithm based on Mason's rule to solve 
flowgraphs like those in Figures 12 and 13. The mo- 
ment generating functions in the transmittances can 
be converted to probability density functions (or 
other summaries, like reliability or hazard functions) 
either analytically or using saddlepoint approxima- 
tion techniques. Huzurbazar (2000, 2005) gave a num- 
ber of examples of flowgraph models, solving flow- 
graphs and inverting flowgraph moment generating 
functions. 

5. RESOURCE ALLOCATION 

Sections 2 and 3 considered the analysis of vari- 
ous sources of component data and a mix of com- 
ponent and system data, respectively, to assess sys- 
tem reliability. In this section, we address how to 
allocate limited testing resources; we simply refer to 
this problem as resource allocation (Hamada et al., 
2004). That is, given a limited budget, where should 
additional tests be done (at the system level and/or 
the component level) and how many tests should be 
done there? 

First, we assume that there is a cost for collecting 
additional data and that it is more costly to collect 
higher level (e.g., system) data than lower level (e.g., 
component) data. For specified costs, a candidate 
allocation, that is, the number of tests at the system 
level and at all the components, must not exceed a 
fixed budget. 

Next, we need a criterion with which to evaluate 
a candidate allocation and to compare two differ- 
ent candidate allocations. We use one based on re- 
peated pre-posterior analyses. The fact that we can 
analyze the varied data presented in Sections 2 and 
3 allows us to take such an approach. The criterion 
can be described operationally as follows. We draw 
from the parameter prior distribution (based on the 
existing data), simulate data according to the can- 
didate allocation using the current prior draw as the 
true parameters and then update with the simulated 
data to obtain the parameter posterior distribution. 
Using this parameter posterior distribution, we eval- 
uate the system reliability posterior distribution and 




CoinpoiKIll 



Fig. 14. Event tree for a two-component series system. 

record some distributional characteristic. For exam- 
ple, we often use the length of the central 90% cred- 
ible interval as a measure of uncertainty that we 
would like to reduce. Repeating this procedure pro- 
duces an empirical distribution of posterior credible 
interval lengths. As a criterion for the candidate al- 
location, we use an upper quantile, for example, the 
0.90 quantile. 

Finally, we need to find the candidate allocation 
that optimizes, in this case minimizes, the criterion. 
To do the optimization problem, we can use a ge- 
netic algorithm (GA) (Goldberg, 1989). We have im- 
plemented a GA for resource allocation in R (R De- 
velopment Core Team, 2004), which generates the 
candidate allocations. A candidate allocation is also 
evaluated in R by repeatedly generating data sets 
and calling YADAS (Graves, 2003, 2007) to do the 
Bayesian updating. YADAS produces an output file 
of parameter posterior draws that is read into R to 
calculate the candidate allocation criterion. 

In the remainder of this section, we consider the 
case where there are only binomial count data at the 
system and component levels. We illustrate resource 
allocation for a simple series system that consists 
of two components as displayed in Figure 14 as an 
event tree. 

Johnson, Graves, Hamada and Reese (2003) showed 
how to combine system and component level bino- 
mial data in a reliability assessment. For example, 
if the series system structure in Figure 14 is valid 
and the components are independent, then the sys- 
tem reliability pi equals P2P3 , the product of the two 
component reliabilities. Consequently, system level 
data are informative about the component reliabili- 
ties through this relationship. 
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Let TCi be the corresponding costs: TCi is the 
cost of a system test, and TC2 and TC^ are the 
component costs. Let Ui be the corresponding num- 
ber of tests so that for budget B, X]f=i^C'j?ij < B. 

Under this scenario where the system structure 
(i.e., series) holds and binomial count data are col- 
lected, resource allocation depends on these costs. 
If TCi > TC2 + TC3, then the optimal allocation 
will consist only of component tests. Even if TCi = 
TC2 + TC3, there is still more information gained 
from individual component tests than one system 
test. That is, doing one system test as compared 
with testing each component once provides less in- 
formation. If TCi = TC2 = TC3, then the optimal 
allocation is all system tests. If TCi < TC2 + TC3, 
there must be a mixture of system and component 
tests, but trying to characterize this mixture re- 
mains to be done. 

An important reason for performing system tests 
is that they are integrative, which in the above dis- 
cussion was not accounted for. That is, does the sys- 
tem work when all the components are assembled? 
Consequently, system tests are needed to assess the 
assumed structure. The previously stated relation- 
ship between the system and component reliabilities 
for the simple series system. 

Pi =P2P3, 

assumes that the series structure with independence 
holds. To allow for the possibility that the assumed 
structure does not hold perfectly, consider the rela- 
tionship 

(3) Pi=P2P3/[p2P3 + (1 -P2P3)exp(-/3)]. 

Here /? is a bias term for which /? = means that 
the series structure with independent components 
holds; also, for /? < 0, pi < P2P3 and for > 0, pi > 
P2P3- Note that if a specific departure from the as- 
sumed system structure is of interest, the departure 
can be accommodated. For example, if there is a pos- 
sible additional failure mode due to common causes, 
the relationship given in Mosleh (1991) can be used. 

For resource allocation, we see from (3) that to 
reduce the uncertainty about the system reliability 
p\ , the uncertainty of the bias term (3 also needs to 
be reduced. To do this requires some system tests. 

Consider resource allocation when the assumed 
system structure may not hold for the following prob- 
lem: 

• The existing data consist of 2 system tests (both 
successes), 5 component 1 tests (5 successes) and 
10 component 2 tests (9 successes). 



• Prior distributions on the component reliabilities 
and bias term (3 are taken to be diffuse. Combined 
with the existing data listed above, the 90% credi- 
ble intervals based on the resource allocation prior 
distributions are (0.83, 1.00) for the component 1 
reliability, (0.77, 0.98) for the component 2 relia- 
bility and (—1.56, 2.75) for the bias term. 

• The resulting resource allocation prior distribu- 
tion for the system reliability has a 90% credi- 
ble interval of (0.579, 0.992). Consequently, the 
length of the initial 90% credible interval for sys- 
tem reliability is 0.413. 

For a budget B = 2500 and costs TCi = 30 and 
TC2 = TC3 = 1, the optimal allocation (based on 
evaluating 2000 candidate allocations using a GA) 
is to do as many system tests (i.e., 83) as possi- 
ble. An allocation of (ni,n2,n3) = (83,10,0) yields 
a value of 0.160 based on 1000 generated data sets 
with 10,000 posterior draws per data analysis. For 
this case, we see that in spite of the system test cost 
being much larger than the component test costs, 
the entire budget is essentially spent on system tests. 
If initially there is less uncertainty about the bias 
term, we expect there to be an allocation between 
system and components tests; recall that the no bias 
case presented above allocated the budget entirely 
to component tests. 

More study of resource allocation for more com- 
plicated systems is needed. On the other hand, for 
a specific situation one needs to employ an opti- 
mization algorithm such as the GA we used in this 
example to find an optimal allocation. 

6. DISCUSSION 

In this paper, we hope that we have conveyed the 
importance of the role that statisticians can play 
in assessing system reliability today and the many 
research challenges that it presents. Somewhat face- 
tiously, we thought of titling this paper "This ain't 
your father's reliability!" or "System reliability 
assessment — a statistician's playground!," because 
both express the excitement that we have about the 
research challenges in this field. 

As Sections 2 and 3 showed, novel statistical mod- 
els arise when statisticians want to leverage informa- 
tion from all available data to bear in an assessment. 
Even assessing a single component can be challeng- 
ing when the data are from computer experiments 
(Santner, Williams and Notz, 2003) in which veri- 
fication, validation and calibration need to be ad- 
dressed, and where multiscale physical experiments 
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and historical system tests with multiple measure- 
ment errors must be integrated. Moreover, as much 
engineering and science knowledge as possible needs 
to be incorporated into the statistical modeling. 

Section 3 presented some of the challenges in in- 
corporating multilevel data from various sources in 
an assessment. Another example occurs when the 
data come from different tests at different levels, 
some of which are done at more severe conditions 
than those experienced in normal use (Reese, Hamada 
and Robinson, 2005). Section 3.5 also discussed elic- 
itation of expert knowledge. This is critical in cap- 
turing both the functional and physical structure of 
a system, and more research is needed on techniques 
and tools for carrying out this activity. 

In Section 4, richer representations than fault trees 
and reliability block diagrams were presented. More 
research is needed on statistical inference with these 
representations. Section 5 presented the emerging 
problem of resource allocation. There are many in- 
teresting problems beyond the binomial case. For 
example, in an accelerated degradation data experi- 
ment on a single component, one needs to determine 
how much of the budget should be spent on this ex- 
periment and subsequently, what levels of the accel- 
erating variable should be studied, how many units 
should be tested at each level and how often the 
units should be inspected. There is much research 
here that remains to be done. 

Implementation of reliability assessment in large 
systems is an issue and tools are needed, which is 
a research effort in itself. Our organization (Statis- 
tical Sciences Group at Los Alamos National Labo- 
ratory) is developing qualitative system representa- 
tion tools such as GROMIT (Klamann and Koehler, 
2005) and statistical modeling tools such as YADAS 
(Graves, 2003, 2007), as well as an interface between 
them. However, many challenges remain. For exam- 
ple, system reliability assessments are computation- 
ally intensive. What approximations can be incor- 
porated without sacrificing accuracy? Do we need 
the power of a supercomputer? Resource allocation 
is even more computationally intensive and brings 
the issues of computation to the forefront. 
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